Cluster-Based Sampling Approaches to Imbalanced Data Distributions
نویسندگان
چکیده
For classification problem, the training data will significantly influence the classification accuracy. When the data set is highly unbalanced, classification algorithms tend to degenerate by assigning all cases to the most common outcome. Hence, it is important to select the suitable training data for classification in the imbalanced class distribution problem. In this paper, we propose cluster-based under-sampling approaches for selecting the representative data as training data to improve the classification accuracy in the imbalanced class distribution environment, i.e., PAKDD competition data set. The CART (Classification and Regression Tree) classification algorithm is considered. The experimental results show that our cluster-based under-sampling approaches can perform the traditional approaches.
منابع مشابه
Cluster-based under-sampling approaches for imbalanced data distributions
For classification problem, the training data will significantly influence the classification accuracy. However, the data in real-world applications often are imbalanced class distribution, that is, most of the data are in majority class and little data are in minority class. In this case, if all the data are used to be the training data, the classifier tends to predict that most of the incomin...
متن کاملUnder-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset
The most important factor of classification for improving classification accuracy is the training data. However, the data in real-world applications often are imbalanced class distribution, that is, most of the data are in majority class and little data are in minority class. In this case, if all the data are used to be the training data, the classifier tends to predict that most of the incomin...
متن کاملImbalanced Data SVM Classification Method Based on Cluster Boundary Sampling and DT-KNN Pruning
This paper presents a SVM classification method based on cluster boundary sampling and sample pruning. We actively explore an effective solution to solve the difficult problem of imbalanced data set classification from data re-sampling and algorithm improving. Firstly, we creatively propose the method of cluster boundary sampling, using the clustering density threshold and the boundary density ...
متن کاملCluster-based Sampling and Ensemble for Bleeding Detection in Capsule Endoscopy Videos
We present a cluster-based sampling and ensemble method to learn from large, imbalanced data set for bleeding detection in CE videos. Our method selects training examples randomly according to the data distributions derived from clustering. Multiple training sets are created such that data balance is restored. The sampling probability is proportional to the cluster distribution, and within each...
متن کاملImproving Imbalanced data classification accuracy by using Fuzzy Similarity Measure and subtractive clustering
Classification is an one of the important parts of data mining and knowledge discovery. In most cases, the data that is utilized to used to training the clusters is not well distributed. This inappropriate distribution occurs when one class has a large number of samples but while the number of other class samples is naturally inherently low. In general, the methods of solving this kind of prob...
متن کامل